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Abstract — This  paper  presents  a  semi-supervised  method  for  cat¬ 
egorizing  human  actions  using  multiple  visual  features.  The  pro¬ 
posed  algorithm  simultaneously  learns  multiple  features  from  a 
small  number  of  labeled  videos,  and  automatically  utilizes  data  dis¬ 
tributions  between  labeled  and  unlabeled  data  to  boost  the  recog¬ 
nition  performance.  Shared  structural  analysis  is  applied  in  our 
approach  to  discover  a  common  subspace  shared  by  each  type  of 
feature.  In  the  subspace,  the  proposed  algorithm  is  able  to  charac¬ 
terize  more  discriminative  information  of  each  feature  type.  Ad¬ 
ditionally,  data  distribution  information  of  each  type  of  feature 
has  been  preserved.  The  aforementioned  attributes  make  our  al¬ 
gorithm  robust  for  action  recognition,  especially  when  only  lim¬ 
ited  labeled  training  samples  are  provided.  Extensive  experiments 
have  been  conducted  on  both  the  choreographed  and  the  realistic 
video  datasets,  including  KTH,  Youtube  action  and  UCF50.  Exper¬ 
imental  results  show  that  our  method  outperforms  several  state-of- 
the-art  algorithms.  Most  notably,  much  better  performances  have 
been  achieved  when  there  are  only  a  few  labeled  training  samples. 

Index  Terms — Human  action  recognition,  multiple  feature 
learning,  semi-supervised  learning,  shared  structural  analysis. 


I.  Introduction 

PEOPLE  are  more  easily  creating  and  sharing  their  personal 
videos  that  contain  actions  due  to  phenomenal  develop¬ 
ments  in  cloud  computing  and  storage  technologies.  As  a  result, 
there  is  a  heavy  demand  for  an  efficient  and  effective  mechanism 
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of  automatic  action  video  annotation  that  is  able  to  facilitate  re¬ 
trieval,  indexing  and  classification.  Supervised  classifiers,  that 
only  use  labeled  training  samples,  have  been  extensively  used 
to  address  the  problems.  Unfortunately,  labeled  data  are  noto¬ 
riously  hard  to  obtain  in  the  real  world.  By  contrast,  collecting 
unlabeled  data  is  often  effortless.  When  confronted  with  huge 
amounts  of  unlabeled  data,  manual  annotation  or  labeling  which 
is  absolutely  tedious  and  time-consuming,  should  always  be  the 
last  choice.  The  goal  of  this  work  is  to  use  multiple  feature  fu¬ 
sion  to  study  human  action  recognition  in  video  data  when  label 
information  is  extremely  insufficient. 

Human  action  recognition  has  been  widely  studied  in  com¬ 
puter  vision  [4].  The  common  approach  is  to  perform  feature 
extractions  from  video  data  and  to  train  a  classifier  from  the 
features  with  class  information.  Generally,  features  for  action 
video  can  be  divided  into  two  groups:  global  features  [5],  [6]  and 
local  features  [7],  [8].  Since  correlations  between  low-level  fea¬ 
tures  may  provide  distinctive  information,  more  research  atten¬ 
tion  [9],  [10]  has  been  put  into  local  feature  correlation  mining 
to  improve  recognition  results.  In  [11],  shared  structural  anal¬ 
ysis  is  applied  to  exploit  multi-label  correlations.  Similar  ideas 
of  the  shared  structure  learning  have  also  been  applied  to  many 
domain  adaptation  applications  [12],  [13]  in  which  a  transfor¬ 
mation  is  learnt  from  the  original  feature  space  of  both  source 
and  target  domains  to  a  subspace.  This  subspace  is  shared  by  all 
domains,  which  means  features  in  every  domain  can  be  trans¬ 
formed  into  this  shared  subspace  and  then  jointly  learnt  within  it. 
Armed  with  this  technique,  cross-view  action  recognition  prob¬ 
lems  have  been  well  investigated  in  [14]. 

As  mentioned  above,  the  scarcity  of  labeled  training  sam¬ 
ples  may  lead  a  supervised  learning  model  to  be  overfitting. 
This  work  mainly  focuses  on  recognizing  actions  represented 
by  multiple  features  when  the  label  information  is  limited.  Even 
though  semi-supervised  learning  and  its  variants  are  proposed 
to  tackle  the  problem  of  insufficient  labeled  data  for  training, 
the  ways  to  learn  multiple  features  in  a  semi-supervised  frame¬ 
work  for  action  recognition  with  a  small  impact  from  noises  and 
outliers  have  been  largely  ignored  so  far.  Besides,  exploiting 
the  shared  structural  information  has  proven  beneficial  to  ac¬ 
tion  recognition  in  [15].  Thus,  attention  should  also  be  paid  to 
analyzing  the  structural  information  shared  by  action  features. 

To  address  the  aforementioned  challenges  in  action  video  an¬ 
notation,  this  paper  proposes  a  novel  semi-supervised  approach 
that  does  not  only  exploit  the  feature  correlations  within  each 
feature  type,  but  also  automatically  leverages  the  multiple  fea¬ 
ture  fusion.  First  of  all,  semi-supervised  methods,  which  are 
able  to  make  use  of  both  labeled  and  unlabeled  data  for  training, 
are  more  suitable  than  supervised  learning  approaches  for  real- 
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world  data.  The  recognition  accuracy  can  be  improved  with 
a  conjunction  of  a  small  amount  of  labeled  data  and  a  large 
amount  of  unlabeled  data.  Secondly,  it  is  assumed  that  similar 
actions  that  are  represented  by  the  occurrence  frequency  of  vi¬ 
sual  words,  should  share  some  common  components  in  repre¬ 
sentation.  For  example,  similar  actions  of  an  arm  exist  in  both 
the  Tennis-Swing  and  the  Golf-Swing  may  have  locally  common 
components.  We  propose  to  characterize  such  a  high-level  se¬ 
mantic  pattern  through  the  low-level  action  features  by  applying 
the  shared  structural  analysis  to  the  Bag-of- Words  (BoW)  rep¬ 
resentation.  By  means  of  directly  exploiting  the  correlations  be¬ 
tween  low-level  features,  resemblant  high-level  semantic  pat¬ 
terns  are  discovered  between  similar  types  of  action  videos. 
Thirdly,  motivated  by  the  latest  research  on  video  analysis  that 
utilizes  multiple  features,  the  framework  is  further  extended  into 
a  multiple  feature  based  manner  to  achieve  better  classification 
performances.  Generally  speaking,  the  semi-supervised  action 
video  annotation  is  separately  modeled  by  each  type  of  feature 
with  the  correlations  between  different  features  simultaneously 
unveiled  as  well.  In  the  proposed  framework,  training  videos 
comprise  both  labeled  and  unlabeled.  Multiple  features  are  ex¬ 
tracted  from  both  the  training  and  testing  videos.  For  the  z-th 
feature  type,  a  graph  model  is  first  constructed  using  distribu¬ 
tions  of  the  z-th  type  of  feature.  Building  upon  this  graph,  virtual 
labels  of  the  unlabeled  data  can  be  generated  by  label  propaga¬ 
tion,  during  which  the  shared  structural  analysis  of  the  features 
is  applied  to  uncover  the  feature  correlations.  This  makes  results 
more  reliable.  For  each  feature  type,  the  consistency  of  nearby 
points  is  separately  preserved,  and  the  label  prediction  of  the 
unlabeled  data  in  the  training  set  is  made  by  joint  consideration 
of  the  global  consistency  of  the  multiple  features.  In  this  way,  a 
multiple  feature  classifier  is  trained  for  action  recognition.  The 
contributions  of  this  paper  can  be  summarized  as  follows: 

•  We  apply  a  semi-supervised  learning  framework  which  an¬ 
alyzes  structures  shared  by  BoW  features  by  uncovering  a 
low-dimensional  subspace  based  on  each  feature  type. 

•  The  proposed  framework  considers  the  global  and  local 
structural  consistency  to  train  a  discriminating  classifier  for 
annotation. 

•  To  maximize  the  holistic  performance,  the  framework  runs 
in  a  multiple  feature  based  manner  with  noise  handling. 

•  Compared  with  other  methods,  the  proposed  method 
demonstrates  better  performances,  especially  when  label 
information  is  quite  scarce. 

The  rest  of  this  paper  is  organized  as  follows:  Related  works 
will  be  reviewed  in  Section  II.  The  proposed  framework  is  elab¬ 
orated  in  Section  III  followed  by  experiments  in  Section  IV. 
Lastly,  Section  V  concludes  this  paper. 

II.  Related  Work 

In  this  section,  we  briefly  review  the  related  research  on 
multiple  feature  learning,  semi-supervised  learning  and  shared 
structural  analysis. 

A.  Multiple  Feature  Learning 

An  object  can  be  described  by  different  features  that  provide 
different  discriminating  information.  In  light  of  this,  research 
attention  on  feature  fusion  for  video  analysis  has  arisen  over 
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recent  years.  Feature  fusion  methods  can  be  categorized  into 
three  strategies:  early  fusion,  late  fusion  and  multi-stage  fusion. 
In  early  fusion  strategy,  a  multimodal  representation,  which  is 
constructed  by  heterogeneous  features,  is  used  to  achieve  clas¬ 
sification  tasks.  A  normal  way  to  gain  benefits  from  multiple 
features  is  to  directly  concatenate  different  types  of  features  to 
form  a  larger  feature  vector.  For  instance,  to  classify  human 
actions,  Sun  et  al.  [16]  apply  concatenation  of  local  descriptors 
and  holistic  features  as  inputs  to  the  Support  Vector  Machine 
(SVM).  Though  this  simple  fusion  scheme  gains  a  good  per¬ 
formance,  the  approach  usually  leads  to  the  computational 
burden  of  processing  larger  feature  vectors.  Meanwhile,  it  does 
not  guarantee  improved  performance.  It  is  possible  that  the 
independence  among  heterogeneous  features  may  degrade  the 
holistic  performance.  In  contrast  with  early  fusion  methods, 
late  fusion  separately  learns  multiple  features  and  builds  a  mul¬ 
timodal  representation  by  combining  learned  models.  In  other 
words,  late  fusion  occurs  after  independent  learning  for  each 
type  of  feature.  Farquhar  et  al  [17]  propose  an  SVM-based 
late  fusion  algorithm,  namely  SVM-2K,  to  learn  two  types  of 
features  in  a  task  of  object  classification.  An  extension  from 
a  supervised  learning  algorithm  to  a  semi-supervised  setting, 
has  been  proposed  by  Li  et  al  [18].  However,  one  drawback 
of  late  fusion  methods  is  the  expensive  cost  in  learning.  This  is 
because  separate  learnings  are  carried  out  with  respect  to  each 
feature  type,  and  extra  learning  is  eventually  conducted  for 
the  fusion.  Moreover,  for  most  late  fusion  approaches,  correla¬ 
tions  of  each  type  of  feature  have  not  been  taken  into  account 
because  the  fusion  occurs  afterwards.  In  addition  to  early  and 
late  fusion  strategies,  the  multi-stage  fusion  scheme  has  also 
been  recently  investigated.  For  example,  Natarajan  et  al  [19] 
firstly  combine  a  large  set  of  visual  and  acoustic  features  using 
multiple  kernel  learning  as  the  early  fusion  scheme.  In  the 
next  stage,  two  different  late  fusion  strategies  are  applied  to 
MKL-based  subsystems.  The  published  results  show  that  there 
exist  additional  performance  improvements  when  multi-stage 
fusion  is  used. 

B.  Graph-Based  Semi- Supervised  Learning 

The  motivation  of  semi-supervised  learning  stems  from  the 
prohibitive  cost  of  manually  annotating  a  large  amount  of  data. 
As  one  of  the  important  branches,  graph-based  semi-supervised 
learning  has  attracted  many  research  interests  [20].  The  main 
paradigm  of  graph-based  semi-supervised  learning  is  to  utilize 
relations  between  labeled  and  unlabeled  data  by  exploring 
the  manifold  structure.  Since  graph-based  semi-supervised 
methods  are  discriminative,  they  have  been  successfully  ap¬ 
plied  in  a  number  of  applications.  Zhou  et  al  [21]  propose 
a  graph-based  semi-supervised  method  that  learns  local  and 
global  consistency,  namely  LGC.  Specifically,  a  regulariza¬ 
tion  framework  that  iteratively  predicts  label  information 
of  unlabeled  samples  has  been  developed.  In  [22],  a  graph, 
which  is  constructed  with  a  spatial  Markov  kernel,  integrates 
intra-image  context.  Afterwards,  graph-based  semi-supervised 
learning  propagates  the  labels  of  unlabeled  images  on  the 
graph.  Active  learning  is  consequently  combined  to  achieve 
interactive  classification.  Ma  and  et  al  [23]  use  the  graph-based 
semi-supervised  framework  incorporating  feature  selection  to 
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learn  classification  information  from  real-world  image  data. 
In  addition,  graph-based  semi-supervised  learning  has  also 
been  applied  in  a  number  of  video  content-based  applications, 
including  video  retrieval  [24],  video  annotation  [25],  action 
recognition  [15]  and  multiple  person  tracking  [26]. 

C.  Shared  Structure  Analysis 

Recently,  shared  structure  analysis  has  been  applied  to  multi¬ 
label  learning  [11],  [27]  and  multi-task  learning  [28].  Taking  the 
correlations  between  different  labels  into  account,  Ando  et  al 
[28]  have  proposed  an  approach  to  minimize  the  total  loss  of  a 
subset  of  predicting  functions  {/z(£;)}^=1.  Such  a  classifier  is  a 
linear  combination  of  one  classifier  in  the  original  feature  space 
and  another  classifier  in  a  low-dimensional  subspace  projected 
by  a  transformation  matrix  ©.  The  classification  problem  is  then 
converted  into  an  optimization  problem  in  the  following  objec¬ 
tive  function: 


min 

Vl  ,Wl  ,© 


s.t.  0T0  = 


Y^loss(ft(x§  ,y^)+r(vhwl)j+nQ,(f)j 

I ,  (1) 


where  ni  is  the  sample  number  of  the  Z-th  class.  fi(x\ )  = 
vTx\  +  pTSTxli.  loss(-)  is  the  least  squared  loss  function. 
r(-)  and  U(-)  are  regularization  functions  using  the  Frobenius 
norm.  Note  that  y\  is  the  ground  truth  label  of  the  datum  X{ 
which  indicates  whether  X{  belongs  to  the  l- th  category.  Ando 
et  al  claim  that  if  multiple  tasks  are  correlated  in  a  multi-task 
learning  problem,  benefit  can  be  significantly  obtained  from  a 
common  structure  shared  by  multiple  predictors.  Also,  exper¬ 
imental  results  demonstrate  that  the  shared  structure  learning 
in  their  linearly  combined  predictor  is  very  helpful  to  extract 
the  underlying  correlations  between  tasks.  In  a  follow-up  work 
[1 1],  Ji  etal.  point  out  it  is  essential  to  exploit  correlation  infor¬ 
mation  contained  in  different  labels,  and  propose  a  combined 
predictive  function  for  multi-label  classification.  This  function 
consists  of  representations  in  the  original  feature  space  as  well 
as  representations  in  a  shared  low-dimensional  subspace.  As  a 
result,  the  correlation  information  is  added  to  a  conventional 
multi-label  classification  framework  by  using  this  joint  predic¬ 
tive  function. 


III.  The  Proposed  Approach 

This  section  begins  with  an  elaboration  of  the  formulation 
of  the  proposed  algorithm  for  action  video  annotation.  Our 
method  incorporates  several  techniques  including  multiple 
feature  learning,  graph-based  semi-supervised  learning,  shared 
subspace  analysis,  the  .£2,1 -norm  loss  function,  and  manifold 
learning.  It  is  named  Multiple  Feature  Correlation  Uncovering 
(MFCU).  Following  this,  we  present  a  detailed  solution  of  how 
to  obtain  the  classifier. 

A.  Formulation 

In  this  work,  we  borrow  the  idea  of  structural  learning  in  [1 1], 
[27]  and  exploit  the  correlations  among  different  visual  words 


by  discovering  the  structural  information  shared  by  low-level 
features.  If  we  properly  exploit  such  a  shared  structure,  a  more 
discriminative  classifier  for  action  recognition  can  be  obtained. 
Specifically,  we  jointly  take  into  account  the  original  feature 
space  and  the  shared  structural  subspace  through  the  following 
function: 


f(X )  =  XTV  +  XtQP  =  XTW,  (2) 

where  W  =  V  +  QP,  W  e  R',x':.  Q  is  a  transformation  matrix 
which  reflects  the  low-dimensional  subspace  shared  by  different 
features.  V  and  P  are  two  weight  matrices  in  the  original  feature 
space  and  the  low-dimensional  subspace,  respectively.  Building 
upon  (2),  Ji  et  al  [11]  have  proposed  to  learn  the  shared  sub¬ 
space  by  incorporating  a  least  squared  loss  function.  Their  ap¬ 
proach  explores  the  shared  subspace  between  different  tasks  and 
is  easy  to  implement.  One  drawback  is  that  the  least  squared  loss 
function  is  sensitive  to  outliers.  Therefore,  the  ,£2,1 -norm  is  pro¬ 
posed  to  apply  to  the  loss  function  which  is  more  sophisticated 
and  robust  [29],  and  obtain  the  following  objective  function: 


min  || XTW  -  Y  (2.1  +  a|| W\\2F  +  (3\\W  -  QP\\2F 

W.P.Q 

s.t.  QtQ  =  I,  (3) 

where  a  and  (3  are  regularization  parameters.  |W|||,  controls 
the  complexity  of  the  model  to  avoid  overfitting.  Similar  to 
\\W\\2F,  \\W-QP\\2f  should  be  small  as  a  penalty  term  when  (3 
is  close  to  a  large  number.  According  to  V  =  W  —  QP  derived 
from  (2),  the  weight  of  representations  in  the  original  space,  V, 
drops  while  the  weight  of  representations  in  the  transformed 
subspace  (the  shared  subspace),  P,  increases,  and  vice  versa. 
Thus,  1 1 W  —  QP | \p  regularizes  the  shared  structure  information. 
Through  (3),  we  aim  to  construct  a  robust  classifier  that  is  both 
discriminative  in  the  original  feature  space  and  is  capable  of 
discovering  the  correlations  between  visual  words  in  the  trans¬ 
formed  low-dimensional  subspace.  Through  such  a  joint  classi¬ 
fier,  classification  performance  can  be  further  improved  [15]. 

One  limitation  of  the  framework  in  [  1 1  ]  is  that  only  a  single 
type  of  feature  is  applied.  Performance  can  be  improved  by  ap¬ 
plying  multiple  features.  Another  limitation  is  that  this  method 
relies  on  fully  labeled  training  data.  Our  method  is  extended  to 
a  semi-supervised  approach  due  to  its  advantage  in  saving  la¬ 
beling  costs  while  simultaneously  achieving  good  performance. 
Most  semi-supervised  learning  methods  assume  that  nearby 
points  are  likely  to  have  the  same  labels.  Specifically,  data 
points  which  can  be  connected  via  a  path  through  high  density 
regions  on  the  data  manifold  are  likely  to  have  the  same  label. 
In  fact,  information  about  density  and  manifold  is  inadequate 
in  the  real  world  because  of  the  scarcity  of  labeled  data.  To 
deal  with  this  problem,  a  graph  is  utilized  to  approximate  the 
density  and  manifold  information  for  semi-supervised  learning 
in  the  framework.  To  begin  with,  the  multiple  feature  training 
data  set  is  redefined  as  Xv  =  [Xlv,  X“],  1  <  v  <  m.  m  is  the 
number  of  feature  types.  For  each  feature  type,  Xlv  and  X™ 
are  two  subsets  of  data  with  and  without  labels  respectively. 
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Inspired  by  [11],  [15],  [27],  [29],  we  propose  a  joint  multiple  By  setting  the  derivative  of  the  above  objective  w.r.t.  Pv  to  0, 
feature  learning  framework  as  follows:  we  have: 


mm  tr 

F,  Fv  ,  Wv 
Q  v  ,  Pv  ,  \  v 


+  tr(F  -  Y)tU{F  -  Y) 


m 

+hi  PJyOt\\Wv\\2F+f3\\Wv—QvPv\\2F+\\XTwv—Fv\\2)i) 

V=1 

m 


+  ^YJ\\F-FV 

V  =  1 


II 2 

II F 


s.t.  QlQv  =  I,  y>  =  l,  A„e[0, 1],  (4) 

v=l 


where  pi,  p2,  o  and  (3  are  regularization  parameters.  1 1  Wv  1 1 \ 
and  ||  Wv  —  QvPvWf  undertake  the  same  jobs  as  their  counter¬ 
parts  in  (3)  w.r.t.  each  feature  type.  F  is  the  global  label  predic¬ 
tion,  while  Fv  is  the  label  prediction  of  the  v-th  feature  type. 
The  term  min f,fv  11^  —  i^|| 2F  reflects  the  philosophy 

that  predictions  based  on  each  feature  type  should  be  consis¬ 
tent  with  the  global  type.  Definition  of  the  Laplacian  matrix  of 
the  v-th  feature  type,  Lv ,  can  be  found  in  [15].  Note  that  we  add 
the  term,  YZT=i  K!>Pv  to  balance  contributions  from  structural 
information  with  respect  to  each  feature  type  [30].  U  is  a  selec¬ 
tion  matrix  that  is  defined  as: 


f  oo  if  X{  is  labeled; 
1 0  otherwise. 


(5) 


The  shared  structure  learning  was  initially  proposed  for 
multi-label  learning  in  [11].  In  our  multiple  feature  learning 
framework,  the  idea  of  uncovering  the  shared  structure  is  ap¬ 
plied  to  exploiting  shared  information  among  different  features. 
Moreover,  this  framework  preserves  independent  structural 
information  from  each  feature  which  contributes  to  a  better 
understanding  of  action  videos. 


B.  Optimization 

We  use  an  alternating  approach  to  optimize  our  objective 
function.  First,  we  fix  A v  =  1/m  and  F  to  optimize  the  other 
variants.  Since  the  initialized  F  is  the  one  optimized  through  the 
following  objective: 


nun  tr{F  -  Y)TU(F  -Y)  +  tr 


(6) 


The  initial  value  of  F  is  obtained  by  setting  the  derivative  of  (6) 
w.r.t.  F  to  0  as  follows: 


F  = 


u  +  J2KLv 


-1 

UY 


After  fixing  F  and  Av,  the  optimization  problem  becomes: 

m 

min  m  F„||2^ 

Qv,pI  v  =  l 

m 

+  ^YJ\\F-FV\ 


& 


v  =  1 


s.t.  QyQv  —  I 


Pv  =  QtvWv  (8) 

According  to  [31],  a  general  ^2,i-norm  minimization  problem 
represented  as: 

mmf(U)  +  J2\\Aku  +  Bkh,i, 

k 

s.t.  (7  G  C 

can  be  solved  by  the  following  problem  iteratively: 

min  f(U)  +  ]T  tr((AkU  +  Bk)T  Dk(AkU  +  Bk), 

k 

s.t.  U  £  C 

Therefore,  after  substituting  Pv  in  (8),  the  objective  problem  in 
(7)  can  be  solved  by  iteratively  solving  the  following  problem: 

m 

min  hi  Y  (tr{X^Wv  -  FV)T Dv  (X?WV  -  Fv) 

FV,WV,QV  L '  V 
v=l 

+trWf  {(a  +  (3)1  -  (3QvQtv  )  Wv) 

rrt 

+  p2  ^  ||^  —  Fv\\2f 

v=l 

S.t.  QtvQv  =  I  (9) 

where  Dv  is  a  diagonal  matrix  with  DVii  =  l/2||z*  ||2,  Zv  = 
X^WV  -  Fv  and  =  f^, . . . ,  z™]T  £  IRnXc.  Note  that  in 
practice,  \\zlv  ||2  could  be  very  close  to  zero.  In  this  case,  we  can 
follow  the  traditional  regularization  way  and  define  the  diagonal 
elements  of  Dv  as  DVii  =  l / (2\\zlv\\%  +  q) ,  where  c  is  a  small 
constant.  When  c  >  0,  it  is  easy  to  see  that  1/(2||^*  ||2  +  c) 
approximates  l/(2\\zlv  ||2.  By  setting  the  derivative  of  the  above 
function  w.r.t.  Wv  to  0,  we  get: 

Wv  =  (Mv  -  (3QvQT)~1XvDvFv,  (10) 

where  Mv  =  XVDVX^  +  (a  +  (3)1.  Substituting  Wv  into  (9), 
the  objective  function  becomes: 

m 

min  hi  ^3  {trFv  DvFv  -  trF* DVX„  N~l XVDVFV) 

Fv’Qv  v=i 

m 

+  ^2,  11^  “  Fv\\f 

v=l 

s.t.  QyQv  =  I,  (11) 

where  Nv  =  Mv  —  /3QVQ^ .  By  setting  the  derivative  of  the 
above  objective  function  w.r.t.  Fv  to  0,  we  have: 

Fv  =  p2GvF ,  (12) 

where  Gv  =  (Av  -  pxDyX^ Np1XvDv)  1  and  Av  = 
P\DV  +  ^P  According  to  the  Woodbury  matrix  identity  [32], 
Gv  can  be  written  as: 

Gv  =  A^1  +  HiEl (. Nv  -  hi XVDVA~1DVX^)~1EV,  (13) 


(7) 
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where  Ev  =  XVDVAV  1 .  By  substituting  Fv  in  (12)  into  (9),  we 
have: 


min  —  Y^tr{FTGvF)  (14) 

Q"  V=1 

Substituting  Gv  in  (13)  into  (14),  the  optimization  problem  is 
equivalent  to  the  following  one: 

m 

max  trFT ( Nv  -  niXv DVA~1DVX^)  1 EVF  (15) 

Qv 

V  =  1 

In  (15),  the  term  ( Nv  —  (iiXvDvA~lDvX^)  1,  according  to 
the  Woodbury  matrix  identity  [32],  is  rewritten  as: 

1  +  (qtv{I  -  (U-'Qv)-1)  QtvJ-\  (16) 

where  Jv  =  Mv  —  /xi XVDVA~1DVX^ .  As  Jv  is  independent 
on  Qv ,  the  optimization  problem  therefore  comes  to  the  fol¬ 
lowing  objective  function: 

ma xtrFTE^J~1Qv(Q^  (/  -  (3 J"1)  JpEvF 

Q  v 

S.t.  QtvQv  =  I  (17) 

For  two  arbitrary  matrices  A  and  B ,  tr(AB)  =  tr(BA).  We 
therefore  rewrite  (17)  as  follows: 

nmxtr(QT  (/  -  /JJ"1)  Qv)  1 J^EVFFT  E^  J~  lQv 

Q  V 

S.t.  QlQv  =  I  (18) 


Algorithm  1:  The  MFCU  algorithm. 


Input: 

The  training  data  are  presented  by  m  types  of  features 
X1,...,Xm€Udxn; 

The  training  data  labels  Y  €  Mnxc; 

Parameters  a,  /3  and  fx \ 

Output: 

Converged  Wv  G  Rdxc. 

1:  Compute  the  graph  Laplacian  matrices  Lv  G  Mnxn; 

2:  Compute  the  selection  matrix  U  g  Mnxn; 

3:  Initialize  Wv  G  Rdxc  and  Fv  elnxc  randomly; 

4:  Initialize  ^  and  F  =  (U  +  x  A %LV))-XUY\ 

5:  repeat 

Compute  Zv  G  Rnxc  as:  Zv  =  X//Wv  —  Fv 

Compute  the  diagonal  matrices  Dv  as: 

r  i 


Dv  — 


L  2||z»||a  J 

Compute  Qv ,  Fv,  Wv  and  Pv  according  to  (21),  (12), 
(10)  and  (8),  respectively. 

Update  F  according  to  (23) 

Update  Xv  according  to  (27) 
until  Convergence ; 

6:  Return  Wv. 


Now  A v  is  the  only  variant  to  be  solved.  From  the  objective 
function,  we  notice  that  is  only  related  to: 

(m  \  m 

FtJ2KLvF\  ,  ^A„  =  l,  Xv  g  [0. 1]  (24) 

V—l  J  V=1 

By  using  a  Lagrange  multiplier  £,  we  convert  the  problem  to  a 
Lagrange  function  as: 


Let  Kv  and  be: 


7-1 


Kv  —  I  —  (5J~ 

cv=j;1evffteIj~1 


(19) 

(20) 


After  substituting  Kv  and  Cv  into  (18),  the  objective  function 
is  reformulated  as: 


m&xtr(Qv  KVQV)  1QyCvQv 

Qv 

S.t.  QyQV  =  I 


(21) 


Thus,  the  above  objective  function  can  be  solved  by  the  eigen- 
decomposition  of  K~1CV.  Next,  we  fix  Pv ,  Wv  and  Fv  to  opti¬ 
mize  F  and  Av  through  the  following  objective  function: 


min  tr(F  -  Y)TU(F  -  Y )  +  tr  (  FT  A ZLV J F 


\\2 

If 


+  /i2  ^2  11^  —  Fv\ 

v=l 

m 

S.t.  ^  ^  Xv  =  1,  A v  E  [0,  1] 


(22) 


V  =  1 


After  fixing  Xv  =  1/m  and  setting  the  derivative  of  (22)  w.r.t. 
F  to  0,  it  becomes: 


-l 


F=  ^A^  +  C/  +  2/i2/  [UYYfi2J2Fv)  (23) 


Lag{  Xv,0  =  tr[FTY ]  A IKF  )  -  d  ^  A.  -  1  (25) 


u= 1 


Setting  the  derivative  w.r.t.  Av  and  £  to  0  respectively,  we  have: 

7Ar : 4r(FTLvF)  -  e  =  0 

EV-1  =  0  <26> 

V  =  1 

We  thus  obtain  Av  by  solving  the  following  equation: 

f  l  A 

\tr(FTLvF )  J 


A„  — 


y1  / _ I _ )  <'r_1) 

2^  \  tr(FTLvF)  I 
v  =  l  v  7 


(27) 


Consequently,  an  iterative  algorithm  is  proposed  to  solve  the  ob¬ 
jective  function  in  Algorithm  1.  The  proposed  iterative  method 
in  Algorithm  1  can  be  verified  to  converge  by  the  following 
theorem. 

Theorem  1:  The  objective  function  value  shown  in  (4)  mono- 
tonically  decreases  in  each  iteration  until  convergence  using  the 
iterative  approach  in  Algorithm  l.1 


IV.  Experiments 

In  this  section,  we  first  introduce  action  video  datasets,  fol¬ 
lowed  by  a  presentation  of  used  features  and  compared  methods. 
Lastly,  extensive  experiments  are  conducted  to  evaluate  this  ap¬ 
proach  and  experimental  results  are  reported  and  discussed. 

iProof  can  be  found  at  https://sites.google.com/site/homepageofsenwang/. 
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A.  Datasets  and  Features 

In  the  experiments  three  action  video  datasets  are  used,  in¬ 
cluding  the  KTH  dataset  [1],  the  YouTube  action  dataset  [2]  and 
the  UCF50  dataset  [3].  The  KTH  actions  [1]  dataset  records  six 
categories  of  actions.  Each  action  is  performed  by  25  subjects 
under  four  different  scenarios.  In  total,  KTH  contains  599  video 
clips  (2391  sequences).  The  Youtube  action  [2]  dataset  collects 
1600  action  video  clips  of  11  categories  from  Youtube.com. 
This  dataset  is  much  more  challenging  than  KTH  due  to  large 
variations  in  camera  motion,  viewpoint,  background,  etc.  The 
UCF50  action  [3]  dataset  is  an  extension  of  the  YouTube  action 
dataset  from  1 1  to  50  categories.  In  total,  it  has  6681  video  clips 
showing  identical  resolution  with  the  Youtube  action  dataset. 

According  to  [33],  Harris3D  interest  point  detector  [7]  and 
HOG/HOF  descriptors  [34]  have  shown  promising  performance 
for  action  recognition.  Besides,  the  MoSIFT  feature  [8]  that 
treats  video  spatial  information  and  temporal  information  sep¬ 
arately,  offers  more  robustness  on  real-world  data,  e.g.  surveil¬ 
lance  videos.  These  two  features  are  extracted  from  all  video 
data.  The  Bag-of- Words  (BoW)  model  is  used  to  represent  the 
videos  due  to  its  popularity  in  the  field  of  human  action  recogni¬ 
tion.  Technically,  we  follow  the  same  setting  utilized  in  [33]  and 
randomly  select  two  groups  of  100,000  training  features  from 
HoG/HoF  and  MoSIFT,  respectively.  The  unsupervised  clus¬ 
tering  algorithm,  i.e.  -means,  is  applied  to  build  two  codebooks 
for  these  two  features.  To  increase  the  precision,  we  choose  the 
centers  with  the  lowest  errors  as  the  codebook  by  randomly  ini¬ 
tializing  k -means  10  times.  The  size  of  the  two  codebooks  are 
empirically  and  uniformly  set  to  1000.  For  video  data,  the  BoW 
is  utilized  to  build  two  histograms  to  represent  a  video  using  two 
different  features. 

B.  Compared  Methods  and  Experimental  Setup 

To  evaluate  the  performance  of  our  framework,  the  proposed 
algorithm  is  compared  to  six  state-of-the-art  methods  which  in¬ 
clude  SVM  with  the  x2  kernel  [34],  TaylorBoost  (TBoost)  [35], 
Semi-supervised  Feature  Correlation  Mining  (SFCM)  [15], 
Semi-supervised  Discriminative  Trace  Ratio  analysis  (SDTR) 
[36],  simpleMKL  [37]  and  SVM-2K  [17].  SVM,  TBoost  and 
simpleMKL  are  three  supervised  state-of-the-art  classification 
algorithms.  Particularly,  SVM-x2  has  been  widely  applied  in 
human  action  recognition  due  to  its  prominent  performance  for 
the  BoW  model.  SFCM  and  SDTR  are  two  semi-supervised 
algorithms.  SVM-2K  is  a  classic  two-type  feature  learning 
algorithm  which  only  deals  with  two  types  of  features.  Explicit 
feature  map  [38]  that  approximates  x2  kernel  is  performed  on 
the  data  for  SDTR,  SFCM,  TBoost,  SVM2K,  as  well  as  our 
approach.  SVM-x2  and  simpleMKL  use  their  default  kernels 
on  the  original  data. 

For  the  KTH  action  dataset,  we  use  the  standard  data  par¬ 
tition  provided  by  the  author:  a  training  set  (eight  persons),  a 
validation  set  (eight  persons)  and  a  test  set  (nine  persons).  For 
the  YouTube  action  dataset  and  the  UCF50  action  dataset,  we 
randomly  split  each  dataset  into  training  and  testing  sets.  The 
detailed  setting  for  comparison  is  followed  by  the  convention  of 
semi-supervised  learning  approaches.  Specifically,  the  training 
set  contains  both  labeled  and  unlabeled  data,  and  the  testing  set 
is  not  available  during  the  training  phrase.  Denote  c  as  the  class 
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number  of  each  dataset  (c  =  6,11  and  50  for  KTH,  Youtube  and 
UCF50  respectively).  We  randomly  sample  m  labeled  videos 
(m  =  1,  3,  5,  10  and  15)  per  category  in  the  training  set,  thus 
resulting  in  1  x  c,  3  x  c,  5  x  c,  10  x  c  and  15  x  c  randomly  la¬ 
beled  videos,  with  the  remaining  training  videos  unlabeled.  The 
experiments  are  conducted  on  ten  groups  of  randomly  generated 
training  and  testing  sets  for  all  the  methods,  and  average  results 
are  reported. 

In  the  proposed  algorithm,  the  parameter  k  specifying  the  k 
nearest  neighbors  for  computing  the  Laplacian  matrix  is  set  as 
5.  P,  which  is  the  dimensionality  of  the  shared  structural  sub¬ 
space,  and  7  are  set  to  c  —  1  and  1 0  empirically  as  they  are  not 
sensitive.  Additionally,  we  tune  the  parameters  a,  (3  and  pi2 
from  {1(T4, 10“2, 1, 102, 104}.  For  SVM-x2,  SFCM,  SDTR, 
TBoost,  simpleMKL  and  SVM-2K,  we  also  tune  their  param¬ 
eters  from  the  same  range  using  a  validation  set  for  KTH,  and 
5-fold  cross  validation  for  the  other  two  datasets.  For  SFCM, 
SDTR,  TBoost,  simpleMKL  and  SVM-x2 ,  multiple  features  are 
concatenated  to  form  a  larger  feature  vector.  For  MFCU  and 
SVM-2K,  multiple  features  have  been  learnt  separately.  Be¬ 
sides  accuracy,  mean  average  precision  (MAP)  is  used  as  an¬ 
other  metric  for  evaluations  in  the  experiments. 

C.  Experimental  Results 

Extensive  experiments  have  been  conducted  upon  three 
datasets  in  three  rounds.  The  proposed  method  has  been 
evaluated  and  compared  with  others  by  two  measurements. 
Specifically,  we  firstly  compare  the  proposed  method  to  those 
other  approaches  that  only  apply  a  single  type  of  feature.  Ex¬ 
cept  for  our  multiple  feature  learning  approach,  each  compared 
method  has  been  performed  with  both  the  SIFT  and  MoSIFT 
features  separately,  and  their  results  have  been  compared  in 
Fig.  1.  Note  that  SVM-2K  is  not  compared  here  because  it 
leverages  two  different  features  simultaneously.  Next,  com¬ 
parisons  are  made  among  all  approaches  that  apply  multiple 
features  and  the  results,  in  terms  of  average  accuracy  and 
mean  average  precision,  are  given  in  Tables  I  and  II.  Lastly, 
the  impact  of  shared  structure  analysis  in  the  framework  and 
the  convergence  demonstration  are  shown  in  Figs.  4  and  2, 
respectively. 

From  Fig.  1,  it  is  observed  that  MFCU  outperforms  other  ap¬ 
proaches  that  only  use  one  type  of  feature.  This  demonstrates 
that  using  multiple  features  is  beneficial.  In  terms  of  average 
accuracy  and  mean  average  precision,  our  method  is  consis¬ 
tently  the  best  on  both  the  choreographed  data  (KTH)  and  the 
real-world  data  (Youtube  and  UCF50).  MFCU  achieves  much 
better  results  especially  when  only  a  few  labeled  training  data 
are  available.  In  the  case  of  1  x  c  (one  labeled  data  per  class) 
for  the  KTH  dataset,  for  example,  the  accuracy  and  MAP  of 
our  approach  score  at  58.24%  and  49.93%  respectively,  which 
are  about  two  times  higher  than  those  of  TBoost,  SVM-x2  and 
SDTR.  Compared  to  the  second  best  competitor,  SFCM,  our 
multiple  feature  learning  algorithm  still  has  significant  advan¬ 
tages  in  both  accuracy  and  MAP. 

2 When  /j1(Dv  -  DVX^ Np1  XVDV)  +  fi2I  in  (11),  /i2  should  be 
no  less  than  the  absolute  value  of  the  smallest  eigenvalue  of  fii(Dv  — 
DVXJ Np1XvDv)  +  ft2I  to  guarantee  the  quadratic  form  of  Fv  is  positive 
semi-definite. 
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^■SDTR-S 

HsimpleMKL-S  1  1SFCM-S 

lSVM--/2-S 

^■TBoost-S  Ours 

^■SDTR-M 

■■simpleMKL-M  1  3SFCM-M 

^■svm-x2-m 

^■TBoost-M 

Fig.  1 .  Performance  comparisons  on  the  three  datasets  w.r.t.  different  numbers  of  labeled  training  data.  Note  that  each  compared  method  has  been  conducted  using 
both  STIP  and  MoSIFT  features.  For  example,  SVM-x2-S  and  SVM-x2-M  denote  SVM  with  the  x2  kernel  applies  the  STIP  and  MoSIFT  features  respectively. 


TABLE  I 

Performance  Comparison  (Accuracy) 


SDTR 

simpleMKL 

SFCM(Early  Fusion) 

SVM2K 

SVM-x2 

TBoost 

Ours 

KTH 

1  X  c 

3  x  c 

5  X  c 
10  x  c 
15  x  c 

0.2978  ±  0.0468 
0.5518  ±  0.0565 
0.6607  ±  0.0175 
0.7557  ±  0.0423 
0.7766  ±  0.0554 

0.4737  ±  0.0833 
0.6204  ±  0.0662 
0.7092  ±  0.0072 
0.7590  ±  0.0366 
0.7914  ±  0.0242 

0.5639  ±  0.0349 
0.6563  zb  0.0304 
0.7198  zb  0.0223 
0.7768  ±  0.0263 
0.8324  zb  0.0260 

0.5451  ±  0.0491 
0.6447  zb  0.0457 
0.7087  ±  0.0167 
0.7731  zb  0.0229 
0.8065  =b  0.0349 

0.3122  ±  0.0526 
0.5228  zb  0.0399 
0.6656  ±  0.0285 
0.7578  ±  0.0383 
0.8269  zb  0.0207 

0.2572  ±  0.0248 
0.4014  ±  0.0601 
0.5745  ±  0.0569 
0.6889  ±  0.0606 
0.7620  ±  0.0162 

0.5824  ±  0.0373 
0.6614  ±  0.0379 
0.7271  ±  0.0150 
0.7838  ±  0.0211 
0.8440  ±  0.0115 

Youtube 

1  x  c 

3  x  c 

5  x  c 
10  x  c 
15  x  c 

0.2262  ±  0.0296 
0.3442  ±  0.0266 
0.4227  ±  0.0226 
0.5447  ±  0.0140 
0.6254  ±  0.0137 

0.2400  ±  0.0125 
0.3261  ±  0.0257 
0.3928  ±  0.0288 
0.5036  ±  0.0116 
0.5638  ±  0.0165 

0.2924  ±  0.0355 
0.3898  zb  0.0220 
0.4772  zb  0.0184 
0.5801  zb  0.0135 
0.6431  zb  0.0153 

0.2789  ±  0.0318 
0.4097  ±  0.0210 
0.5011  ±  0.0319 
0.6023  ±  0.0335 
0.6722  zb  0.0122 

0.2405  ±  0.0366 
0.3348  zb  0.0188 
0.4292  ±0.0161 
0.5333  ±  0.0259 
0.5860  ±  0.0124 

0.1498  ±  0.0279 
0.2392  ±  0.0332 
0.2627  ±  0.0148 
0.3019  ±  0.0175 
0.3468  ±  0.0210 

0.3383  ±  0.0278 
0.4634  ±  0.0185 
0.5308  ±  0.0346 
0.6142  ±  0.0113 
0.6941  ±  0.0162 

UCF50 

1  x  c 

3  x  c 

5  x  c 
10  x  c 
15  x  c 

0.1156  ±  0.0133 
0.2582  ±  0.0042 
0.3346  ±  0.0116 
0.5307  ±  0.0166 
0.6318  ±  0.0059 

0.1638  ±  0.0116 
0.2889  ±  0.0109 
0.3749  ±  0.0105 
0.4700  ±  0.0071 
0.5307  ±  0.0048 

0.2104  =b  0.0195 
0.3347  zb  0.0166 
0.4503  zb  0.0064 
0.5407  zb  0.0120 
0.5965  zb  0.0035 

0.2205  zb  0.0297 
0.3649  ±  0.0104 
0.4617  zb  0.0063 
0.5629  zb  0.0036 
0.6155  zb  0.0139 

0.1406  ±  0.0179 
0.2889  ±  0.0070 
0.3851  ±  0.0106 
0.5106  ±  0.0173 
0.5901  ±  0.0067 

0.0890  ±  0.0193 
0.1517  ±  0.0051 
0.1976  ±  0.0087 
0.2362  ±  0.0112 
0.2736  ±  0.0093 

0.2364  ±  0.0165 
0.3708  ±  0.0160 
0.4780  ±  0.0057 
0.5878  ±  0.0100 
0.6464  ±  0.0055 

In  Tables  I  and  II,  even  though  all  approaches  have  used 
two  features,  MFCU  still  performs  better  than  all  the  compared 
methods.  Specifically,  MFCU  outperforms  all  fully  supervised 
methods  (SVM2K,  SVM-2,  Tboost  and  simpleMKL).  This  is 
because  insufficient  label  information  is  unable  to  train  a  de¬ 
cent  classifier  with  supervised  learning  algorithms.  By  contrast, 
our  approach  benefits  from  semi-supervised  learning  which  can 
utilize  both  labeled  and  unlabeled  data.  Compared  with  two 
semi-supervised  methods  (SDTR  and  SFCM),  improvements 
are  from  different  sources:  1.  MFCU  has  a  more  sophisticated 
fusion  strategy  than  SFCM.  In  our  late  fusion  strategy,  local  and 
global  consistency  are  considered  together.  In  this  way,  gains 
from  feature  fusion  are  augmented;  2.  Compared  with  SDTR, 
the  shared  structural  learning  and  the  ^2,i-norm  take  advantage 
in  terms  of  feature  correlation  mining  and  noise  handling.  From 


Tables  I  and  II,  it  is  also  found  that  with  the  increase  of  la¬ 
beled  training  samples,  the  performance  of  all  algorithms  rises. 
Meanwhile,  the  performance  differences  between  our  method 
and  the  others  decrease  on  KTH.  The  differences,  by  contrast, 
are  noticeable  on  Youtube  and  UCF50.  We  thus  conclude  that 
our  method  is  robust  for  different  kinds  of  data  when  the  training 
data  number  varies. 

To  validate  our  claim  that  the  proposed  iterative  algorithm 
monotonically  decreases  the  objective  function  value  in  (4) 
until  convergence,  experiments  have  been  conducted  on  all  the 
datasets.  The  number  of  labeled  training  samples  is  set  to  15  x  c 
for  each  dataset  and  the  parameters  are  set  to  the  median  value 
of  the  tuned  range.  The  results  in  Fig.  2  demonstrate  by  using 
Algorithm  1  that  the  objective  function  value  monotonically 
decreases  and  converges  after  only  a  few  iterations. 
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TABLE  II 

Performance  Comparison  (MAP) 


SDTR 

simpleMKL 

SFCM 

SVM2K 

SVM-x^ 

TBoost 

Ours 

KTH 

1  x  c 

3  x  c 

5  x  c 
10  x  c 
15  x  c 

0.2764  ±  0.0326 
0.4648  ±  0.0480 
0.5657  ±  0.0129 
0.6634  ±  0.0436 
0.6908  ±  0.0650 

0.3965  ±  0.0659 
0.5114  ±  0.0646 
0.6022  ±  0.0117 
0.6625  ±  0.0426 
0.7104  ±  0.0265 

0.4735  ±  0.0489 
0.5447  ±  0.0318 
0.5998  ±  0.0320 
0.6635  ±  0.0302 
0.7335  ±0.0119 

0.4544  ±  0.0606 
0.5523  ±  0.0491 
0.6160  ±  0.0343 
0.6010  ±  0.0223 
0.6695  ±  0.0652 

0.2845  ±  0.0377 
0.4435  ±  0.0408 
0.5690  ±  0.0275 
0.6633  ±  0.0434 
0.7492  ±  0.0189 

0.2307  ±  0.0135 
0.3734  ±  0.0510 
0.4915  ±  0.0640 
0.5774  ±  0.0681 
0.6617  ±  0.0143 

0.4993  ±  0.0547 
0.5623  ±  0.0366 
0.6167  ±  0.0217 
0.6860  ±  0.0251 
0.7512  ±  0.0068 

Youtube 

1  x  c 

3  x  c 

5  x  c 
10  x  c 
15  x  c 

0.2145  ±  0.0167 
0.2855  ±  0.0180 
0.3443  ±  0.0128 
0.4289  ±  0.0190 
0.4863  ±0.0113 

0.2229  ±0.0181 
0.2680  ±  0.0169 
0.2974  ±  0.0365 
0.3748  ±  0.0129 
0.4251  ±  0.0227 

0.2423  ±  0.0148 
0.3171  ±  0.0216 
0.3643  ±  0.0214 
0.4492  ±  0.0233 
0.5034  ±  0.0133 

0.2458  ±  0.0289 
0.3226  ±  0.0284 
0.3788  ±  0.0245 
0.4693  ±  0.0339 
0.5264  ±  0.0136 

0.2191  ±  0.0174 
0.2844  ±  0.0104 
0.3305  ±  0.0144 
0.4066  ±  0.0230 
0.4452  ±0.0112 

0.1747  ±  0.0072 
0.2093  ±  0.0251 
0.2134  ±  0.0042 
0.2416  ±  0.0173 
0.2706  ±  0.0162 

0.2631  ±  0.0211 
0.3557  ±  0.0223 
0.3950  ±  0.0341 
0.4916  ±  0.0189 
0.5387  ±  0.0149 

UCF50 

1  x  c 

3  x  c 

5  x  c 
10  x  c 
15  x  c 

0.0844  ±  0.0087 
0.1704  ±  0.0016 
0.2228  ±  0.0105 
0.3789  ±  0.0172 
0.4393  ±  0.0039 

0.0996  ±  0.0083 
0.1617  ±  0.0133 
0.2126  ±  0.0096 
0.2789  ±  0.0086 
0.3288  ±  0.0034 

0.1117  ±  0.0126 
0.1892  ±  0.0112 
0.2671  ±  0.0040 
0.3595  ±  0.0102 
0.4204  ±  0.0085 

0.1202  ±  0.0184 
0.2008  ±  0.0096 
0.2758  ±  0.0093 
0.3755  ±  0.0138 
0.4410  ±  0.0138 

0.0913  ±  0.0144 
0.1749  ±  0.0043 
0.2439  ±  0.0065 
0.3357  ±  0.0145 
0.4100  ±  0.0070 

0.0649  ±  0.0079 
0.0911  ±  0.0025 
0.1139  ±  0.0051 
0.1321  ±  0.0042 
0.1451  ±  0.0039 

0.1341  ±  0.0100 
0.2166  ±  0.0145 
0.2961  ±  0.0091 
0.3935  ±  0.0127 
0.4582  ±  0.0083 

Iteration  Times 


(a) 


(b)  (c) 


Fig.  2.  The  convergence  curves  of  the  objective  function  values  in  (4)  by  using  algorithm  1  on  the  three  datasets. 
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Fig.  4.  The  variation  of  accuracy  w.r.t.  the  parameter  (3  with  fixed  a  and  /ii . 


Fig.  3.  Performance  comparison  between  i2 ,  i  -norm  and  F-norm. 


To  investigate  the  impact  of  the  -£2,i-norm  in  the  framework, 
performance  comparisons  between  using  the  -£2,i-norm  and  re¬ 
moving  it  (substituted  by  F-norm)  on  the  Youtube  action  dataset 
have  been  made.  The  results  in  Fig.  3  show  that  improvements 
are  gained  when  using  the  i^i-norm  for  all  different  numbers 
of  labeled  training  data.  The  results  in  Fig.  4  verify  that  our  al¬ 
gorithm  benefits  from  shared  structural  analysis.  The  real-world 
video  dataset,  Youtube,  is  taken  as  an  example  to  demonstrate 
the  impact  of  shared  structure  learning.  We  fix  a  and  (i\  at  their 
optimal  values,  i.e.  10°  and  104  respectively  for  10  x  c  labeled 
training  data.  It  can  be  seen  that  as  (3  varies  from  10-2  to  10,  the 
accuracy  increases  accordingly  and  reaches  to  the  peak  value 
when  (3  =  10.  Note  that,  a  larger  (3  means  a  larger  propor¬ 
tion  of  shared  structural  consideration  in  the  holistic  framework, 
and  vice  versa.  When  (3  =  0,  no  shared  structure  is  utilized  in 


the  framework.  The  results  demonstrate  that  appropriately  ex¬ 
ploiting  subspace  shared  by  low-level  features  can  further  im¬ 
prove  the  performance.  Specifically,  when  the  number  of  la¬ 
beled  training  data  is  10  x  c  (the  Youtube  action  dataset),  the 
extra  improvement  from  the  shared  structural  learning  is  1 .0%, 
while  the  difference  between  using  the  ^i-norm  and  removing 
it,  is  1.5%.  Overall,  the  combination  of  graph-based  semi-su¬ 
pervised  learning,  the  ^2;i-norm  and  shared  structural  analysis 
has  integrally  contributed  to  the  performance  boosting  of  our 
method. 

D.  Discussion 

From  the  experimental  results,  this  proposed  approach,  in 
which  multi-feature  learning  is  integrated  in  a  graph-based 
semi-supervised  framework,  performs  action  recognition 
better  than  all  the  compared  methods  particularly  when  la¬ 
beled  training  samples  are  insufficient.  However,  it  is  still 
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worth  noting  the  following  facts:  1.  Though  the  .£21 -norm 
loss  function  improves  performances  by  handling  noises,  its 
optimization  requires  an  iterative  algorithm  which  is  more 
expensive  than  the  F-norm  loss  function.  When  efficiency  is 
a  concern,  the  F-norm  can  be  a  substitution  of  the  ^21-norm 
in  our  proposed  framework;  2.  As  indicated  in  [39],  improved 
performance  is  not  guaranteed  through  exploiting  unlabeled 
data  when  a  manifold  assumption  does  not  hold.  Additionally, 
complementary  relationships  between  different  features  may 
result  in  performance  fluctuations. 

V.  Conclusion 

In  this  paper,  we  have  proposed  an  approach  that  exploits 
multiple  features  to  categorize  human  action  videos  by  ex¬ 
ploring  the  correlations  between  different  visual  words.  Firstly, 
the  proposed  method  simultaneously  discovers  the  intrinsic 
relations  between  visual  words  in  a  low-dimensional  subspace 
to  improve  the  performance  of  the  holistic  classification  based 
on  each  feature  type.  Secondly,  the  ^2,i-norm  is  applied  to 
make  the  framework  robust  for  noises  and  outliers.  Thirdly, 
two  assumptions  have  been  utilized  in  the  framework:  1)  the 
label  prediction  should  be  consistent  with  the  ground  truth 
for  each  feature  type;  2)  the  label  prediction  for  each  feature 
type  should  also  be  consistent  with  the  global  prediction  using 
multiple  features.  Finally,  the  framework  has  been  extended  to 
semi-supervised  exploiting  both  labeled  and  unlabeled  videos. 
The  framework  for  action  video  annotation  has  been  evaluated 
on  three  datasets  including  both  the  choreographed  and  the 
realistic  data.  The  experimental  results  show  that  our  approach 
outperforms  all  the  compared  algorithms.  The  advantage  is 
especially  visible  when  the  amount  of  labeled  training  data  is 
quite  small. 
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